rgb video
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New York (0.04)
- Europe > Greece (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Two-Stream Network for Sign Language Recognition and Translation
Sign languages are visual languages using manual articulations and non-manual elements to convey information. For sign language recognition and translation, the majority of existing approaches directly encode RGB videos into hidden representations. RGB videos, however, are raw signals with substantial visual redundancy, leading the encoder to overlook the key information for sign language understanding. To mitigate this problem and better incorporate domain knowledge, such as handshape and body movement, we introduce a dual visual encoder containing two separate streams to model both the raw videos and the keypoint sequences generated by an off-the-shelf keypoint estimator. To make the two streams interact with each other, we explore a variety of techniques, including bidirectional lateral connection, sign pyramid network with auxiliary supervision, and frame-level self-distillation. The resulting model is called TwoStream-SLR, which is competent for sign language recognition (SLR). TwoStream-SLR is extended to a sign language translation (SLT) model, TwoStream-SLT, by simply attaching an extra translation network. Experimentally, our TwoStream-SLR and TwoStream-SLT achieve state-of-the-art performance on SLR and SLT tasks across a series of datasets including Phoenix-2014, Phoenix-2014T, and CSL-Daily.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New York (0.04)
- Europe > Greece (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
HDMI: Learning Interactive Humanoid Whole-Body Control from Human Videos
Weng, Haoyang, Li, Yitang, Sobanbabu, Nikhil, Wang, Zihan, Luo, Zhengyi, He, Tairan, Ramanan, Deva, Shi, Guanya
Figure 1: HDMI enables humanoid robots to acquire diverse whole-body interaction skills directly from human videos. Abstract-- Enabling robust whole-body humanoid-object interaction (HOI) remains challenging due to motion data scarcity and the contact-rich nature. We present HDMI (H umanoiD iM itation for I nteraction), a simple and general framework that learns whole-body humanoid-object interaction skills directly from monocular RGB videos. Our pipeline (i) extracts and retargets human and object trajectories from unconstrained videos to build structured motion datasets, (ii) trains a rein-The authors are with the Robotics Institute, Carnegie Mellon University, USA. Extensive sim-to-real experiments on a Unitree G1 humanoid demonstrate the robustness and generality of our approach: HDMI achieves 67 consecutive door traversals and successfully performs 6 distinct loco-manipulation tasks in the real world and 14 tasks in simulation. Our results establish HDMI as a simple and general framework for acquiring interactive humanoid skills from human videos. Humanoid robots hold immense potential for assisting humans in diverse environments due to their human-like morphology and versatility.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Asia (0.04)
Human Action Anticipation: A Survey
Lai, Bolin, Toyer, Sam, Nagarajan, Tushar, Girdhar, Rohit, Zha, Shengxin, Rehg, James M., Kitani, Kris, Grauman, Kristen, Desai, Ruta, Liu, Miao
Predicting future human behavior is an increasingly popular topic in computer vision, driven by the interest in applications such as autonomous vehicles, digital assistants and human-robot interactions. The literature on behavior prediction spans various tasks, including action anticipation, activity forecasting, intent prediction, goal prediction, and so on. Our survey aims to tie together this fragmented literature, covering recent technical innovations as well as the development of new large-scale datasets for model training and evaluation. We also summarize the widely-used metrics for different tasks and provide a comprehensive performance comparison of existing approaches on eleven action anticipation datasets. This survey serves as not only a reference for contemporary methodologies in action anticipation, but also a guideline for future research direction of this evolving landscape.
- North America > United States > Texas > Travis County > Austin (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
- North America > United States > California > Alameda County > Berkeley (0.14)
- (3 more...)
- Research Report (1.00)
- Overview (1.00)
- Education (1.00)
- Leisure & Entertainment (0.93)
- Media (0.93)
- Health & Medicine (0.67)
Two-Stream Network for Sign Language Recognition and Translation
Sign languages are visual languages using manual articulations and non-manual elements to convey information. For sign language recognition and translation, the majority of existing approaches directly encode RGB videos into hidden representations. RGB videos, however, are raw signals with substantial visual redundancy, leading the encoder to overlook the key information for sign language understanding. To mitigate this problem and better incorporate domain knowledge, such as handshape and body movement, we introduce a dual visual encoder containing two separate streams to model both the raw videos and the keypoint sequences generated by an off-the-shelf keypoint estimator. To make the two streams interact with each other, we explore a variety of techniques, including bidirectional lateral connection, sign pyramid network with auxiliary supervision, and frame-level self-distillation.
Taylor Videos for Action Recognition
Wang, Lei, Yuan, Xiuyuan, Gedeon, Tom, Zheng, Liang
Effectively extracting motions from video is a critical and long-standing problem for action recognition. This problem is very challenging because motions (i) do not have an explicit form, (ii) have various concepts such as displacement, velocity, and acceleration, and (iii) often contain noise caused by unstable pixels. Addressing these challenges, we propose the Taylor video, a new video format that highlights the dominate motions (e.g., a waving hand) in each of its frames named the Taylor frame. Taylor video is named after Taylor series, which approximates a function at a given point using important terms. In the scenario of videos, we define an implicit motion-extraction function which aims to extract motions from video temporal block. In this block, using the frames, the difference frames, and higher-order difference frames, we perform Taylor expansion to approximate this function at the starting frame. We show the summation of the higher-order terms in the Taylor series gives us dominant motion patterns, where static objects, small and unstable motions are removed. Experimentally we show that Taylor videos are effective inputs to popular architectures including 2D CNNs, 3D CNNs, and transformers. When used individually, Taylor videos yield competitive action recognition accuracy compared to RGB videos and optical flow. When fused with RGB or optical flow videos, further accuracy improvement is achieved.
- Oceania > Australia > Western Australia > Perth (0.04)
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (2 more...)
CP-AGCN: Pytorch-based Attention Informed Graph Convolutional Network for Identifying Infants at Risk of Cerebral Palsy
Zhang, Haozheng, Ho, Edmond S. L., Shum, Hubert P. H.
There are about 2 3 CP patients in 1000 children in the UK [1], which is similar to other developed countries. Although CP cannot be completely cured at present, early prediction of CP and intervention are considered as a paramount part of the treatment. Current clinical early prediction of CP is investigated by General Movement Assessment (GMA) [2]. GMA can be done in person by GM assessors to assess an infant, or it can be done via watching an RGB video that has recorded the general movements of the infant. However, the GMA training is time-and resourceconsuming, making it challenging to cope with the high demand for CP prediction. To tackle this problem, we propose automating this process by analyzing the general movements of infants from RGB videos. This allows the early prediction to cover even the lower-risk population. Motivated by the encouraging results reported in recent research based on skeletal data [3, 4, 5, 6, 7, 8, 9, 10], the 2D joint locations of the infant are extracted from RGB videos as the input of the system for CP prediction. The computational intelligence of our system is implemented with a graph convolution network, a kind of deep artificial neural network that models relational data very well, making it suitable for skeleton data.
- Europe > United Kingdom > Wales (0.04)
- Europe > United Kingdom > England (0.04)
Get 3D models of multiple objects in RGB video with RayTran
Get 3D models of multiple objects in RGB video with RayTran RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers arXiv paper abstract https://arxiv.org/abs/2203.13296 arXiv PDF paper https://arxiv.org/pdf/2203.13296.pdf ... propose a transformer-based neural network architecture for multi-object 3D reconstruction from RGB videos. ... represent its knowledge: as a global 3D grid of features and an array of view-specific 2D grids. ... ex
AIR-Act2Act: Human-human interaction dataset for teaching non-verbal social behaviors to robots
Ko, Woo-Ri, Jang, Minsu, Lee, Jaeyeon, Kim, Jaehong
To better interact with users, a social robot should understand the users' behavior, infer the intention, and respond appropriately. Machine learning is one way of implementing robot intelligence. It provides the ability to automatically learn and improve from experience instead of explicitly telling the robot what to do. Social skills can also be learned through watching human-human interaction videos. However, human-human interaction datasets are relatively scarce to learn interactions that occur in various situations. Moreover, we aim to use service robots in the elderly-care domain; however, there has been no interaction dataset collected for this domain. For this reason, we introduce a human-human interaction dataset for teaching non-verbal social behaviors to robots. It is the only interaction dataset that elderly people have participated in as performers. We recruited 100 elderly people and two college students to perform 10 interactions in an indoor environment. The entire dataset has 5,000 interaction samples, each of which contains depth maps, body indexes and 3D skeletal data that are captured with three Microsoft Kinect v2 cameras. In addition, we provide the joint angles of a humanoid NAO robot which are converted from the human behavior that robots need to learn. The dataset and useful python scripts are available for download at https://github.com/ai4r/AIR-Act2Act. It can be used to not only teach social skills to robots but also benchmark action recognition algorithms.
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- Asia > South Korea > Daejeon > Daejeon (0.04)